The goal of this notebook is to implement a logistic regression classifier using stochastic gradient ascent. You will:
import graphlab as gl
print('gl.version: %s' % (gl.version))
gl.canvas.set_target('ipynb')
from __future__ import division
import math
import string
# my imports
import pandas as pd
import numpy as np
from six.moves import cPickle as pickle
from types import MethodType
def value_counts( self ):
import pandas as pd
pdDf = self.to_dataframe()
for ftr in pdDf.columns:
print(pdDf[ftr].value_counts())
#SFrame.value_counts = MethodType(value_counts, None, SFrame)
#setattr(SFrame, 'value_counts', value_counts)
#setattr(glbObsAll, 'value_counts', value_counts)
def SFrameShow( self , catClmn = None, title = None):
if (title != None):
print('%s Summary:' % (title))
print(self.shape)
self.show()
if catClmn != None:
for clmn in catClmn:
print(' %s category summary:' % (clmn))
glbObsAll[clmn].show(view = 'Categorical')
print(self)
For this assignment, we will use the same subset of the Amazon product review dataset that we used in Module 3 assignment. The subset was chosen to contain similar numbers of positive and negative reviews, as the original dataset consisted of mostly positive reviews.
glbObsAll = gl.SFrame('data/amazon_baby_subset.gl/')
SFrameShow(glbObsAll, catClmn = ['rating', 'sentiment'], title = 'glbObsAll')
Just like we did previously, we will work with a hand-curated list of important words extracted from the review data. We will also perform 2 simple data transformations:
Refer to Module 3 assignment for more details.
import json
with open('data/important_words.json', 'r') as f:
impWrd = json.load(f)
impWrd = [str(s) for s in impWrd]
print impWrd
# Remote punctuation
def rmvPunctuation(text):
import string
return text.translate(None, string.punctuation)
glbObsAll['review.Cln'] = glbObsAll['review'].apply(rmvPunctuation)
# Split out the words into individual columns
for word in impWrd:
glbObsAll[word] = glbObsAll['review.Cln'].apply(lambda s : s.split().count(word))
The SFrame glbObsAll now contains one column for each of the 193 impWrd.
# Exclude impWrd for show
SFrameShow(glbObsAll[list(set(glbObsAll.column_names()) - set(impWrd))],
catClmn = ['rating', 'sentiment'], title = 'glbObsAll excl. impWrd')
We will now split the data into a 90-10 split where 90% is in the training set and 10% is in the validation set. We use seed=1 so that everyone gets the same result.
glbObsFit, glbObsOOB = glbObsAll.random_split(.9, seed=1)
print 'Training set : %d data points' % len(glbObsFit)
print 'Validation set: %d data points' % len(glbObsOOB)
Just like in the earlier assignments, we provide you with a function that extracts columns from an SFrame and converts them into a NumPy array. Two arrays are returned: one representing features and another representing class labels.
Note: The feature matrix includes an additional column '.intercept' filled with 1's to take account of the intercept term.
import numpy as np
def toNdarray(data_sframe, features, label):
data_sframe['.intercept'] = 1
features = ['.intercept'] + features
features_sframe = data_sframe[features]
ftrMtrx = features_sframe.to_numpy()
label_sarray = data_sframe[label]
label_array = label_sarray.to_numpy()
return(ftrMtrx, label_array)
Note that we convert both the training and validation sets into NumPy arrays.
Warning: This may take a few minutes.
glbObsFitFtrMtrx, glbObsFitRsp = toNdarray(glbObsFit, impWrd, 'sentiment')
glbObsOOBFtrMtrx, glbObsOOBRsp = toNdarray(glbObsOOB, impWrd, 'sentiment')
Are you running this notebook on an Amazon EC2 t2.micro instance? (If you are using your own machine, please skip this section)
It has been reported that t2.micro instances do not provide sufficient power to complete the conversion in acceptable amount of time. For interest of time, please refrain from running toNdarray function. Instead, download the binary file containing the four NumPy arrays you'll need for the assignment. To load the arrays, run the following commands:
arrays = np.load('module-10-assignment-numpy-arrays.npz')
glbObsFitFtrMtrx, glbObsFitRsp = arrays['glbObsFitFtrMtrx'], arrays['glbObsFitRsp']
glbObsOOBFtrMtrx, glbObsOOBRsp = arrays['glbObsOOBFtrMtrx'], arrays['glbObsOOBRsp']
Quiz question: In Module 3 assignment, there were 194 features (an intercept + one feature for each of the 193 important words). In this assignment, we will use stochastic gradient ascent to train the classifier using logistic regression. How does the changing the solver to stochastic gradient ascent affect the number of features?
Let us now build on Module 3 assignment. Recall from lecture that the link function for logistic regression can be defined as:
$$ P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))}, $$where the feature vector $h(\mathbf{x}_i)$ is given by the word counts of impWrd in the review $\mathbf{x}_i$.
We will use the same code as in Module 3 assignment to make probability predictions, since this part is not affected by using stochastic gradient ascent as a solver. Only the way in which the coefficients are learned is affected by using stochastic gradient ascent as a solver.
'''
produces probablistic estimate for P(y_i = +1 | x_i, w).
estimate ranges between 0 and 1.
'''
def predctProbability(ftrMtrx, coefficients):
# Take dot product of ftrMtrx and coefficients
score = np.dot(ftrMtrx, coefficients)
# Compute P(y_i = +1 | x_i, w) using the link function
predictions = 1. / (1.+np.exp(-score))
return predictions
Let us now work on making minor changes to how the derivative computation is performed for logistic regression.
Recall from the lectures and Module 3 assignment that for logistic regression, the derivative of log likelihood with respect to a single coefficient is as follows:
$$ \frac{\partial\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right) $$In Module 3 assignment, we wrote a function to compute the derivative of log likelihood with respect to a single coefficient $w_j$. The function accepts the following two parameters:
errors vector containing $(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w}))$ for all $i$feature vector containing $h_j(\mathbf{x}_i)$ for all $i$Complete the following code block:
def getFeatureDerivative(errors, feature):
# Compute the dot product of errors and feature
## YOUR CODE HERE
derivative = np.dot(errors, feature)
return derivative
Note. We are not using regularization in this assignment, but, as discussed in the optional video, stochastic gradient can also be used for regularized logistic regression.
To verify the correctness of the gradient computation, we provide a function for computing average log likelihood (which we recall from the last assignment was a topic detailed in an advanced optional video, and used here for its numerical stability).
To track the performance of stochastic gradient ascent, we provide a function for computing average log likelihood.
$$\ell\ell_A(\mathbf{w}) = \color{red}{\frac{1}{N}} \sum_{i=1}^N \Big( (\mathbf{1}[y_i = +1] - 1)\mathbf{w}^T h(\mathbf{x}_i) - \ln\left(1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))\right) \Big) $$Note that we made one tiny modification to the log likelihood function (called compute_log_likelihood) in our earlier assignments. We added a $\color{red}{1/N}$ term which averages the log likelihood accross all data points. The $\color{red}{1/N}$ term makes it easier for us to compare stochastic gradient ascent with batch gradient ascent. We will use this function to generate plots that are similar to those you saw in the lecture.
def getAvgLogLikelihood(ftrMtrx, rspVctr, coefficients):
indicator = (rspVctr==+1)
scores = np.dot(ftrMtrx, coefficients)
logexp = np.log(1. + np.exp(-scores))
# Simple check to prevent overflow
mask = np.isinf(logexp)
logexp[mask] = -scores[mask]
lp = np.sum((indicator-1)*scores - logexp)/len(ftrMtrx)
return lp
Quiz Question: Recall from the lecture and the earlier assignment, the log likelihood (without the averaging term) is given by
$$\ell\ell(\mathbf{w}) = \sum_{i=1}^N \Big( (\mathbf{1}[y_i = +1] - 1)\mathbf{w}^T h(\mathbf{x}_i) - \ln\left(1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))\right) \Big) $$How are the functions $\ell\ell(\mathbf{w})$ and $\ell\ell_A(\mathbf{w})$ related?
Recall from the lecture that the gradient for a single data point $\color{red}{\mathbf{x}_i}$ can be computed using the following formula:
$$ \frac{\partial\ell_{\color{red}{i}}(\mathbf{w})}{\partial w_j} = h_j(\color{red}{\mathbf{x}_i})\left(\mathbf{1}[y_\color{red}{i} = +1] - P(y_\color{red}{i} = +1 | \color{red}{\mathbf{x}_i}, \mathbf{w})\right) $$Computing the gradient for a single data point
Do we really need to re-write all our code to modify $\partial\ell(\mathbf{w})/\partial w_j$ to $\partial\ell_{\color{red}{i}}(\mathbf{w})/{\partial w_j}$?
Thankfully No!. Using NumPy, we access $\mathbf{x}_i$ in the training data using glbObsFitFtrMtrx[i:i+1,:]
and $y_i$ in the training data using glbObsFitRsp[i:i+1]. We can compute $\partial\ell_{\color{red}{i}}(\mathbf{w})/\partial w_j$ by re-using all the code written in getFeatureDerivative and predctProbability.
We compute $\partial\ell_{\color{red}{i}}(\mathbf{w})/\partial w_j$ using the following steps:
glbObsFitFtrMtrx[i:i+1,:] as the first parameter.glbObsFitRsp[i:i+1].glbObsFitFtrMtrx[i:i+1, j] as one of the parameters. Let us follow these steps for j = 1 and i = 10:
j = 1 # Feature number
i = 10 # Data point number
coefficients = np.zeros(194) # A point w at which we are computing the gradient.
predictions = predctProbability(glbObsFitFtrMtrx[i:i+1,:], coefficients)
indicator = (glbObsFitRsp[i:i+1]==+1)
errors = indicator - predictions
gradient_single_data_point = getFeatureDerivative(errors, glbObsFitFtrMtrx[i:i+1,j])
print "Gradient single data point: %s" % gradient_single_data_point
print " --> Should print 0.0"
Quiz Question: The code block above computed $\partial\ell_{\color{red}{i}}(\mathbf{w})/{\partial w_j}$ for j = 1 and i = 10. Is $\partial\ell_{\color{red}{i}}(\mathbf{w})/{\partial w_j}$ a scalar or a 194-dimensional vector?
Stochastic gradient estimates the ascent direction using 1 data point, while gradient uses $N$ data points to decide how to update the the parameters. In an optional video, we discussed the details of a simple change that allows us to use a mini-batch of $B \leq N$ data points to estimate the ascent direction. This simple approach is faster than regular gradient but less noisy than stochastic gradient that uses only 1 data point. Although we encorage you to watch the optional video on the topic to better understand why mini-batches help stochastic gradient, in this assignment, we will simply use this technique, since the approach is very simple and will improve your results.
Given a mini-batch (or a set of data points) $\mathbf{x}_{i}, \mathbf{x}_{i+1} \ldots \mathbf{x}_{i+B}$, the gradient function for this mini-batch of data points is given by: $$ \color{red}{\sum_{s = i}^{i+B}} \frac{\partial\ell_{s}}{\partial w_j} = \color{red}{\sum_{s = i}^{i + B}} h_j(\mathbf{x}_s)\left(\mathbf{1}[y_s = +1] - P(y_s = +1 | \mathbf{x}_s, \mathbf{w})\right) $$
Computing the gradient for a "mini-batch" of data points
Using NumPy, we access the points $\mathbf{x}_i, \mathbf{x}_{i+1} \ldots \mathbf{x}_{i+B}$ in the training data using glbObsFitFtrMtrx[i:i+B,:]
and $y_i$ in the training data using glbObsFitRsp[i:i+B].
We can compute $\color{red}{\sum_{s = i}^{i+B}} \partial\ell_{s}/\partial w_j$ easily as follows:
j = 1 # Feature number
i = 10 # Data point start
B = 10 # Mini-batch size
coefficients = np.zeros(194) # A point w at which we are computing the gradient.
predictions = predctProbability(glbObsFitFtrMtrx[i:i+B,:], coefficients)
indicator = (glbObsFitRsp[i:i+B]==+1)
errors = indicator - predictions
gradient_mini_batch = getFeatureDerivative(errors, glbObsFitFtrMtrx[i:i+B,j])
print "Gradient mini-batch data points: %s" % gradient_mini_batch
print " --> Should print 1.0"
Quiz Question: The code block above computed
$\color{red}{\sum_{s = i}^{i+B}}\partial\ell_{s}(\mathbf{w})/{\partial w_j}$
for j = 10, i = 10, and B = 10. Is this a scalar or a 194-dimensional vector?
Quiz Question: For what value of B is the term
$\color{red}{\sum_{s = 1}^{B}}\partial\ell_{s}(\mathbf{w})/\partial w_j$
the same as the full gradient
$\partial\ell(\mathbf{w})/{\partial w_j}$?
It is a common practice to normalize the gradient update rule by the batch size B:
$$ \frac{\partial\ell_{\color{red}{A}}(\mathbf{w})}{\partial w_j} \approx \color{red}{\frac{1}{B}} {\sum_{s = i}^{i + B}} h_j(\mathbf{x}_s)\left(\mathbf{1}[y_s = +1] - P(y_s = +1 | \mathbf{x}_s, \mathbf{w})\right) $$In other words, we update the coefficients using the average gradient over data points (instead of using a summation). By using the average gradient, we ensure that the magnitude of the gradient is approximately the same for all batch sizes. This way, we can more easily compare various batch sizes of stochastic gradient ascent (including a batch size of all the data points), and study the effect of batch size on the algorithm as well as the choice of step size.
Now we are ready to implement our own logistic regression with stochastic gradient ascent. Complete the following function to fit a logistic regression model using gradient ascent:
from math import sqrt
def fitLogisticRSGMdl(ftrMtrx, rspVctr, initCff, stepSize, batchSize, maxIter):
logLikelihoodVctr = []
# make sure it's a numpy array
coefficients = np.array(initCff)
# set seed=1 to produce consistent results
np.random.seed(seed=1)
# Shuffle the data before starting
permutation = np.random.permutation(len(ftrMtrx))
ftrMtrx = ftrMtrx[permutation,:]
rspVctr = rspVctr[permutation]
i = 0 # index of current batch
# Do a linear scan over data
for thsIter in xrange(maxIter):
# Predict P(y_i = +1|x_i,w) using your predctProbability() function
# Make sure to slice the i-th row of ftrMtrx with [i:i+batchSize,:]
### YOUR CODE HERE
predictions = predctProbability(ftrMtrx[i:i+batchSize,:], coefficients)
# Compute indicator value for (y_i = +1)
# Make sure to slice the i-th entry with [i:i+batchSize]
### YOUR CODE HERE
indicator = (rspVctr[i:i+batchSize] == +1)
# Compute the errors as indicator - predictions
errors = indicator - predictions
for j in xrange(len(coefficients)): # loop over each coefficient
# Recall that ftrMtrx[:,j] is the feature column associated with coefficients[j]
# Compute the derivative for coefficients[j] and save it to derivative.
# Make sure to slice the i-th row of ftrMtrx with [i:i+batchSize,j]
### YOUR CODE HERE
derivative = getFeatureDerivative(errors, ftrMtrx[i:i+batchSize,j])
# compute the product of the step size, the derivative, and the **normalization constant** (1./batchSize)
### YOUR CODE HERE
coefficients[j] += stepSize * derivative / batchSize
# Checking whether log likelihood is increasing
# Print the log likelihood over the *current batch*
lp = getAvgLogLikelihood(ftrMtrx[i:i+batchSize,:], rspVctr[i:i+batchSize],
coefficients)
logLikelihoodVctr.append(lp)
if (thsIter <= 10) or \
(thsIter <= 100 and thsIter % 10 == 0) or \
(thsIter <= 1000 and thsIter % 100 == 0) or \
(thsIter <= 10000 and thsIter % 1000 == 0) or \
(thsIter % 10000 == 0) or \
(thsIter == maxIter-1):
nObs = len(ftrMtrx)
print 'Iteration %*d: Average log likelihood (of data points in batch [%0*d:%0*d]) = %.8f' % \
(int(np.ceil(np.log10(maxIter))), thsIter, \
int(np.ceil(np.log10(nObs))), i, \
int(np.ceil(np.log10(nObs))), i+batchSize, lp)
# if we made a complete pass over data, shuffle and restart
i += batchSize
if i+batchSize > len(ftrMtrx):
permutation = np.random.permutation(len(ftrMtrx))
ftrMtrx = ftrMtrx[permutation,:]
rspVctr = rspVctr[permutation]
i = 0
# We return the list of log likelihoods for plotting purposes.
return coefficients, logLikelihoodVctr
Note. In practice, the final set of coefficients is rarely used; it is better to use the average of the last K sets of coefficients instead, where K should be adjusted depending on how fast the log likelihood oscillates around the optimum.
The following cell tests your stochastic gradient ascent function using a toy dataset consisting of two data points. If the test does not pass, make sure you are normalizing the gradient update rule correctly.
sample_ftrMtrx = np.array([[1.,2.,-1.], [1.,0.,1.]])
sample_rspVctr = np.array([+1, -1])
coefficients, log_likelihood = fitLogisticRSGMdl(sample_ftrMtrx, sample_rspVctr, np.zeros(3),
stepSize=1., batchSize=2, maxIter=2)
print '-------------------------------------------------------------------------------------'
print 'Coefficients learned :', coefficients
print 'Average log likelihood per-iteration :', log_likelihood
if np.allclose(coefficients, np.array([-0.09755757, 0.68242552, -0.7799831]), atol=1e-3)\
and np.allclose(log_likelihood, np.array([-0.33774513108142956, -0.2345530939410341])):
# pass if elements match within 1e-3
print '-------------------------------------------------------------------------------------'
print 'Test passed!'
else:
print '-------------------------------------------------------------------------------------'
print 'Test failed'
For the remainder of the assignment, we will compare stochastic gradient ascent against batch gradient ascent. For this, we need a reference implementation of batch gradient ascent. But do we need to implement this from scratch?
Quiz Question: For what value of batch size B above is the stochastic gradient ascent function fitLogisticRSGMdl act as a standard gradient ascent algorithm?
glbMdlDf = pd.DataFrame()
# fitLogisticRSGMdl should be defined as LogisticRSG ??? & myfitMdl shd be fitLogisticRSGMdl ???
# try dicts for thsMdlDf['cff'] & thsMdlDf['logLhood']
def myfitMdl(mdlDf, ftrMtrx, rspVctr, initCff, stepSize, batchSize, maxIter):
import datetime
if (mdlDf.shape[0] == 0) or \
(mdlDf[(mdlDf.stepSize == stepSize) &
(mdlDf.batchSize == batchSize) &
(mdlDf.maxIter == maxIter) &
True].shape[0] == 0):
thsMdlDf = pd.DataFrame({
'stepSize' : stepSize ,
'batchSize': batchSize,
'maxIter' : maxIter
}, index = [0])
thsMdlDf.set_index(['stepSize', 'batchSize', 'maxIter'], drop = False, inplace = True)
#set_index(self, keys, drop=True, append=False, inplace=False, verify_integrity=False)
startTm = datetime.datetime.now()
print('\nrunning fitLogisticRSGMdl: stepSize: %.4e; batchSize: %d; maxIter: %d' %
(stepSize, batchSize, maxIter))
mdlCff, mdlLogLhood = \
fitLogisticRSGMdl(ftrMtrx, rspVctr, initCff, stepSize, batchSize, maxIter)
thsMdlDf['cff'] = mdlCff.tostring()
thsMdlDf['logLhood'] = str(mdlLogLhood)
thsMdlDf['elapsedSecs'] = (datetime.datetime.now() - startTm).seconds
#glbMdlDf = glbMdlDf.append(mdlDf)
return(thsMdlDf)
Instead of implementing batch gradient ascent separately, we save time by re-using the stochastic gradient ascent function we just wrote — to perform gradient ascent, it suffices to set batchSize to the number of data points in the training data. Yes, we did answer above the quiz question for you, but that is an important point to remember in the future :)
Small Caveat. The batch gradient ascent implementation here is slightly different than the one in the earlier assignments, as we now normalize the gradient update rule.
We now run stochastic gradient ascent over the glbObsFitFtrMtrx for 10 iterations using:
initCff = np.zeros(194)stepSize = 5e-1batchSize = 1maxIter = 10smpCff, smpLogLhood = fitLogisticRSGMdl(glbObsFitFtrMtrx, glbObsFitRsp,
initCff=np.zeros(194),
stepSize=5e-1, batchSize=1, maxIter=10)
Quiz Question. When you set batchSize = 1, as each iteration passes, how does the average log likelihood in the batch change?
print(type(smpCff))
#print(smpCff.tostring())
print(type(smpLogLhood))
print(str(smpLogLhood))
glbMdlDf = glbMdlDf.append(myfitMdl(glbMdlDf, glbObsFitFtrMtrx, glbObsFitRsp,
initCff=np.zeros(194),
stepSize=5e-1, batchSize=1, maxIter=10))
print(glbMdlDf)
Now run batch gradient ascent over the glbObsFitFtrMtrx for 200 iterations using:
initCff = np.zeros(194)stepSize = 5e-1batchSize = len(glbObsFitFtrMtrx)maxIter = 200# YOUR CODE HERE
#coefficients_batch, log_likelihood_batch = ...
glbMdlDf = glbMdlDf.append(myfitMdl(glbMdlDf, glbObsFitFtrMtrx, glbObsFitRsp,
initCff = np.zeros(194),
stepSize = 5e-1, batchSize = len(glbObsFitFtrMtrx), maxIter = 200))
print(glbMdlDf)
Quiz Question. When you set batchSize = len(glbObsFit), as each iteration passes, how does the average log likelihood in the batch change?
To make a fair comparison betweeen stochastic gradient ascent and batch gradient ascent, we measure the average log likelihood as a function of the number of passes (defined as follows): $$ [\text{# of passes}] = \frac{[\text{# of data points touched so far}]}{[\text{size of dataset}]} $$
Quiz Question Suppose that we run stochastic gradient ascent with a batch size of 100. How many gradient updates are performed at the end of two passes over a dataset consisting of 50000 data points?
print(2. * 50000 / 100)
With the terminology in mind, let us run stochastic gradient ascent for 10 passes. We will use
stepSize=1e-1batchSize=100initCff to all zeros.stepSize = 1e-1
batchSize = 100
num_passes = 10
num_iterations = num_passes * int(len(glbObsFitFtrMtrx)/batchSize)
# coefficients_sgd, log_likelihood_sgd = fitLogisticRSGMdl(glbObsFitFtrMtrx, glbObsFitRsp,
# initCff=np.zeros(194),
# stepSize=1e-1, batchSize=100, maxIter=num_iterations)
glbMdlDf = glbMdlDf.append(myfitMdl(glbMdlDf, glbObsFitFtrMtrx, glbObsFitRsp,
initCff=np.zeros(194),
stepSize=1e-1, batchSize=100, maxIter=num_iterations))
print(glbMdlDf)
We provide you with a utility function to plot the average log likelihood as a function of the number of passes.
import matplotlib.pyplot as plt
%matplotlib inline
def pltLogLhood(logLikelihoodVctr, len_data, batchSize, smoothing_window=1, label=''):
plt.rcParams.update({'figure.figsize': (9,5)})
logLikelihoodVctr_ma = np.convolve(np.array(logLikelihoodVctr), \
np.ones((smoothing_window,))/smoothing_window, mode='valid')
plt.plot(np.array(range(smoothing_window-1, len(logLikelihoodVctr)))*float(batchSize)/len_data,
logLikelihoodVctr_ma, linewidth=4.0, label=label)
plt.rcParams.update({'font.size': 16})
plt.tight_layout()
plt.xlabel('# of passes over data')
plt.ylabel('Average log likelihood per data point')
plt.legend(loc='lower right', prop={'size':14})
#[float(x) for x in tstStr.strip("[]").split(',')]
pltLogLhood([float(ll) for ll in
glbMdlDf[(glbMdlDf.stepSize == 1e-1) &
(glbMdlDf.batchSize == 100 ) &
(glbMdlDf.maxIter == 4770)
]['logLhood'][0].strip("[]").split(',')],
len_data=len(glbObsFitFtrMtrx), batchSize=100,
label='stochastic gradient, stepSize=1e-1')
The plotted line oscillates so much that it is hard to see whether the log likelihood is improving. In our plot, we apply a simple smoothing operation using the parameter smoothing_window. The smoothing is simply a moving average of log likelihood over the last smoothing_window "iterations" of stochastic gradient ascent.
pltLogLhood([float(ll) for ll in
glbMdlDf[(glbMdlDf.stepSize == 1e-1) &
(glbMdlDf.batchSize == 100 ) &
(glbMdlDf.maxIter == 4770)
]['logLhood'][0].strip("[]").split(',')],
len_data=len(glbObsFitFtrMtrx), batchSize=100,
smoothing_window=30,
label='stochastic gradient, stepSize=1e-1')
# pltLogLhood(log_likelihood_sgd, len_data=len(glbObsFitFtrMtrx), batchSize=100,
# smoothing_window=30, label='stochastic gradient, stepSize=1e-1')
Checkpoint: The above plot should look smoother than the previous plot. Play around with smoothing_window. As you increase it, you should see a smoother plot.
To compare convergence rates for stochastic gradient ascent with batch gradient ascent, we call pltLogLhood() multiple times in the same cell.
We are comparing:
stepSize = 0.1, batchSize=100stepSize = 0.5, batchSize=len(glbObsFitFtrMtrx)Write code to run stochastic gradient ascent for 200 passes using:
stepSize=1e-1batchSize=100initCff to all zeros.savMdlDf = glbMdlDf
stepSize = 1e-1
batchSize = 100
num_passes = 200
num_iterations = num_passes * int(len(glbObsFitFtrMtrx)/batchSize)
## YOUR CODE HERE
#coefficients_sgd, log_likelihood_sgd = ...
glbMdlDf = glbMdlDf.append(myfitMdl(glbMdlDf, glbObsFitFtrMtrx, glbObsFitRsp,
initCff=np.zeros(194),
stepSize=1e-1, batchSize=100, maxIter=num_iterations))
print(glbMdlDf)
We compare the convergence of stochastic gradient ascent and batch gradient ascent in the following cell. Note that we apply smoothing with smoothing_window=30.
sgdLogLhood = [float(ll) for ll in
glbMdlDf[(glbMdlDf.stepSize == 1e-1) &
(glbMdlDf.batchSize == 100 ) &
(glbMdlDf.maxIter == 95400)
]['logLhood'][0].strip("[]").split(',')]
print(sgdLogLhood[:5])
btcLogLhood = [float(ll) for ll in
glbMdlDf[(glbMdlDf.stepSize == 5e-1 ) &
(glbMdlDf.batchSize == 47780 ) &
(glbMdlDf.maxIter == 200 )
]['logLhood'][0].strip("[]").split(',')]
print(btcLogLhood[:5])
pltLogLhood(sgdLogLhood, len_data=len(glbObsFitFtrMtrx), batchSize=100,
smoothing_window=30, label='stochastic, stepSize=1e-1')
pltLogLhood(btcLogLhood, len_data=len(glbObsFitFtrMtrx), batchSize=len(glbObsFitFtrMtrx),
smoothing_window=1, label='batch, stepSize=5e-1')
Quiz Question: In the figure above, how many passes does batch gradient ascent need to achieve a similar log likelihood as stochastic gradient ascent?
In previous sections, we chose step sizes for you. In practice, it helps to know how to choose good step sizes yourself.
To start, we explore a wide range of step sizes that are equally spaced in the log space. Run stochastic gradient ascent with stepSize set to 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, and 1e2. Use the following set of parameters:
initCff=np.zeros(194)batchSize=100maxIter initialized so as to run 10 passes over the data.savMdlDf = glbMdlDf
# batchSize = 100
# num_passes = 10
# num_iterations = ...
# coefficients_sgd = {}
# log_likelihood_sgd = {}
# for stepSize in np.logspace(-4, 2, num=7):
# coefficients_sgd[stepSize], log_likelihood_sgd[stepSize] = ...
batchSize = 100
num_passes = 10
num_iterations = num_passes * int(len(glbObsFitFtrMtrx)/batchSize)
for stepSize in np.logspace(-4, 2, num=7):
glbMdlDf = glbMdlDf.append(myfitMdl(glbMdlDf, glbObsFitFtrMtrx, glbObsFitRsp,
initCff=np.zeros(194),
stepSize = stepSize, batchSize = batchSize, maxIter=num_iterations))
print(glbMdlDf)
Now, we will plot the change in log likelihood using the pltLogLhood for each of the following values of stepSize:
stepSize = 1e-4stepSize = 1e-3stepSize = 1e-2stepSize = 1e-1stepSize = 1e0stepSize = 1e1stepSize = 1e2For consistency, we again apply smoothing_window=30.
for stepSize in np.logspace(-4, 2, num=7):
sgdLogLhood = [float(ll) for ll in
glbMdlDf[(glbMdlDf.stepSize == stepSize) &
(glbMdlDf.batchSize == 100 ) &
(glbMdlDf.maxIter == 4770)
]['logLhood'][0].strip("[]").split(',')]
# pltLogLhood(log_likelihood_sgd[stepSize], len_data=len(glbObsFit), batchSize=100,
# smoothing_window=30, label='stepSize=%.1e'%stepSize)
pltLogLhood(sgdLogLhood, len_data=len(glbObsFit), batchSize=100,
smoothing_window=30, label='stepSize=%.1e'%stepSize)
Now, let us remove the step size stepSize = 1e2 and plot the rest of the curves.
for stepSize in np.logspace(-4, 2, num=7)[0:6]:
sgdLogLhood = [float(ll) for ll in
glbMdlDf[(glbMdlDf.stepSize == stepSize) &
(glbMdlDf.batchSize == 100 ) &
(glbMdlDf.maxIter == 4770)
]['logLhood'][0].strip("[]").split(',')]
# pltLogLhood(log_likelihood_sgd[stepSize], len_data=len(glbObsFit), batchSize=100,
# smoothing_window=30, label='stepSize=%.1e'%stepSize)
pltLogLhood(sgdLogLhood, len_data=len(glbObsFit), batchSize=100,
smoothing_window=30, label='stepSize=%.1e'%stepSize)
Quiz Question: Which of the following is the worst step size? Pick the step size that results in the lowest log likelihood in the end.
Quiz Question: Which of the following is the best step size? Pick the step size that results in the highest log likelihood in the end.